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Abstract 

A survey is made of several aspects of the dynamics of networks, with 
special emphasis on unsupervised learning processes, non-Gaussian data 
analysis and pattern recognition in networks with complex nodes. 
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1 Recurrent networks. Dynamics and applica- 
tions 

There are three large classes of neural networks. One is the class of multilayered 
feedforward networks which, through supervised learning, are used to approxi- 
mate nonlinear functions. The second is the class of relaxation networks with 
symmetric synaptic connections, like the Hopfield network. Under time evolu- 
tion the relaxation networks evolve to fixed points which, in successful applica- 
tions, are identified with memorized patterns. In the last class one includes all 
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networks with arbitrary connections which arc neither completely feedforward 
nor symmetric. They are called recurrent networks. 

Recurrent networks exhibit a complex variety of temporal behavior and are 
increasingly being proposed for use in many engineering applications H Q |3| 
[|j . In contrast to feedforward or relaxation networks, the rich dynamical struc- 
ture of recurrent networks makes them a natural choice to process temporal 
information. Even for the learning of nonlinear functions the feedback structure 
improves the reproduction of discontinuities or large derivative regions Q. Also, 
in some cases, the feedback structure is a way to enhance, through a choice of ar- 
chitecture, the sensitivity of the network to particular features to be dctectcd||. 
The learning algorithms used for feedforward networks may be generalized to 
the recurrent caseQ Q || ||. Some of this will be covered by another speaker 
at this conference. 

Feedback loops and coexistence of quiescent, oscillatory and chaotic behav- 
ior are also present in biological systems and only the recurrent networks are 
appropriate models for these phenomena ]ic| 

The whole field of recurrent networks, both for engineering and biological 
applications, is developing in many directions. Their global analysis is rather 



involvedj 12 • Their characterization as dynamical systems is sharpened by a 
decomposition theorem. Continuous state - continuous time neural networks, 
as well as many other systems jOI, may be written in the Cohen- Grossberg |l4[ 
form 

^ =ai{xi) lh(xi) -^Wijdjixj'X (1) 
Dynamical systems of this type have a decomposition property. Define 



(S) , (A) 
(S) l / I \ 



(A) 
W ij 



\ (Wij - Wji) (2) 

Then we have the following 

Theorem ]l5| If a,i{xi) / d^xi) > Va;,i and the matrix has an inverse 
then the vector field Xi in Eq.(|l]) decomposes into one gradient and one Hamil- 

. . (G) . (H) 

toman components, xi=xi + Xi , where 
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and 

a ij( x \ = <H(*i) §ij 

9 1 ' W (4) 

(idijP — Sf\ g v (x) and ^-(ai) are the components of the Riemannian metric 
and the symplectic form. 

Proof : The decomposition follows by direct calculation from (|l|) and (||). 
The conditions on a>i(xi), d^Xi) and insure that g is a well defined metric 
and lo is non-degenerate. Indeed let v be a vector such that J2i v% ^ij = 0- Then 



= vt Uijaj(xj)Wji (x) 



ak{x k ) 



would imply v k = Vfc. That co is a closed form follows from the fact that u>ij 
depends only on Xi and Xj . □ 

The identification, in the system ([j]), of just one gradient and one Hamilto- 
nian component with explicitly known potential and Hamiltonian functions, is a 
considerable simplification as compared to a generic dynamical system. Recall 
that in the general case, although such a decomposition is possible locally ]T^|, 
explicit functions are not easy to obtain unless one allows for one gradient and 
n — 1 Hamiltonian components. Notice that the decomposition of the vector 
field does not decouple the dynamical evolution of the components. In fact it is 
the interplay of the dissipative (gradient) and the Hamiltonian component that 
leads, for example, to limit cycle behavior. 

For the case of symmetric connections Wij — Wji one recovers the Cohen- 
Grossberg result H] that states that a symmetric system of the type of Eq. (|lj) 
has a Lyapunov function of which Hopfield'sJlTj] "energy" function is a 

particular case. For the symmetric case the existence of a Lyapunov function 
guarantees global asymptotic stability of the dynamics. However not all vector 
fields with a Lyapunov function are differentially equivalent to a gradient field. 
Therefore the fact that a gradient vector is actually obtained gives additional 
information, namely about structural stability of the model. 

A necessary condition for structural stability of the gradient vector field is 



the non-degeneracy of the critical points of V( s \ namely det 



try 



(S) 



^ at 



the points where = 0. In a gradient flow all orbits approach the critical 

points as t — > oo. If the critical points are non-degenerate then the gradi- 
ent flow satisfies the conditions defining a Morse-Smale field, except perhaps 
the transversality conditions for stable and unstable manifolds of the critical 
points. However because Morse-Smale fields are open and dense in the set 
of gradient vector fields, any gradient flow with non-degenerate critical points 
may always be C 1 -approximated by a (structurally stable) Morse-Smale gradi- 
ent field. Therefore given a symmetric model of the type (0) , the identification 
of its gradient nature provides a easy way to check its robustness as a physical 
model. 
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As an example of the decomposition applied to a biological model consider 
the Wilson-Cowan model of a neural oscillator without refractory periods, in 
the antisymmetric coupling case considered by most authors 



X\ = —Xi+S(pi+WnXi+Wi2X 2 ) 
X 2 = -X 2 + S (p 2 + W 2 lXi + W22X2) 



(5) 



with W12 — —W21 and S is the sigmoid function (1 + e x ) . Changing vari- 
ables to 

2 

z i = Pi +z2 Wi j x i 

i=l 



(6) 



with 



one obtains 

v = { z l - Pi z i + w« log (1 - S(zi))} 

H = Eilog(l-^i)) 

The model is completely described by these functions, the bifurcation sets|M, 
for example, being characterized by AV = for Andronov-Hopf bifurcations 
and by 

d 2 Vd 2 V 2 d 2 Hd 2 H 
dz\ dz 2 + Wl2 dz 2 dz 2 ~ 

for saddle-node bifurcations. 



2 Unsupervised learning in generalized networks 
and the processing of non-Gaussian signals 

2.1 Unsupervised leaning in general networks 

Doyne Farmer |]l9| has shown that there is a common mathematical framework 
where neural networks, classifier systems, immune networks and autocatalytic 
reaction networks may be treated in a unified way. The general model in which 
all these models may be mapped looks like a neural network where, in addition 
to the node state variables (xi) and the connection strengths (Wy), there is also 
a node parameter (Qi) with learning capabilities (Fig.l). The node parameter 
represents the possibility of changing, through learning, the nature of the linear 
or non-linear function fi(Ylj W%jXj) at each node. In the simplest case 6i will 
be simply an intensity parameter. Therefore the degree to which the activity at 
node i influences the activity at other nodes depends not only on the connection 
strengths (Wy) but also on an adaptive node parameter Ot. In some cases, as 
in the B-cell immune network, the node parameter is the only means to control 
the relative influence of a node on others, the connection strengths being fixed 
chemical reaction rates. 
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Figure 1: A general connectionist network with state variables (xi), connection 
strengths (Wy) and node parameters^) 



2.1.1 Hebbian - type learning with a node parameter 

We will denote by Xi the output of node i. Hebbian learning ^cj is a type of un- 
supervised learning where a connection strength W%j is reinforced whenever the 
product Xi X j IS large. As shown by several authors, Hebbian learning extracts 
the eigenvectors of the correlation matrix Q of the input data. 

Qij (XiXj} (7) 

where (...) means the sample average. If the learning law is local, the lines of 
the connection matrix Wij all tend to the eigenvector associated to the largest 
eigenvalue of the correlation matrix. To obtain the other eigenvector directions 
one needs non-local laws |22| ^3|. Sanger's approach has the advantage of 
organizing the connection matrix in such a way that the rows are the eigenvec- 
tors associated to the eigenvalues in decreasing order. It suffers however from 
slow convergence rates for the lowest eigenvalues. The methods that have been 
proposed may, with small modifications, be used both for linear and non-linear 
networks. However, because the maximum information about a signal {xi}, 
that may be coded directly in the connection matrix Wij , is the principal com- 
ponents decomposition and this may already be obtained with linear units, we 
will discuss only this case. 

The learning rules proposed below|2^] are a generalization of Sanger's scheme 
including a node parameter 0j. We consider a one-layer feedforward network 
with as many inputs as outputs (Fig. 2) and the updating rules proposed for 
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x , x 2 x » ... X. ... X XT 

1 z 3 i N 

Figure 2: One-layer neural feedforward network for Hebbian learning with node 
parameter 



Wij and 6i are: 

W l3 (t + 1) = W lJ (t) +lw y l (t){x J (t)-Zl =1 0k 1 yk(t)W k3 (t)} 
Oi(t + l) = 0i(t)+7fllte(*) {!-»<(*)} 
where j/i is the output of node i 

yi = 0iJ2 WijXj (9) 

3 

and 7^ and 7© are positive constants that control the learning rate. As will be 
shown below the learning dynamics of Eqs(fj|) has the capability to accelerate 
the convergence rate for the small eigenvalues of Q. To avoid undesirable fixed 
points of the dynamics, where this acceleration effect is not obtained, the learn- 
ing rules (H) are supplemented by the following prescription: " The starting 9i 's 
are all positive and are not allowed to decrease below a value (0i) m - ln . If, at a 
certain point of the learning process, 9i hits the lower bound then one makes the 
replacement Wij — > —Wij in the line i of the connection matrix 11 (see remark 
3). 

Define the N-dimensional vectors 

( X )i = X * 

m)j = Wij [W > 

Then the following result was proven p3] : 

If the time scale of the system ^) is much slower than the averaging time of 
the input signal {xi} then: 

a) the system (Sj has a stable fixed point such that 



A, 



= £(W.<x» 
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and |W,;| = 1. Aj > (i=l...,N) are the eigenvalues of the correlation 
matrix. 

b) Convergence to the fixed point is sequential in the sense that W, is only 



attracted to the eigenvector described in (11) if all vectors W, for i < j are 
already close to their corresponding eigenvector values. 
Remarks: 

1 - The matrix W of the connection strengths extracts the principal com- 
ponents of the correlation matrix Q. The node parameters 9i at their fixed 
point ( |lT| ) extract additional information on the mean value of the data vector 
x and the eigenvalues of Q. To deal with data with zero mean it is convenient 
to change the ^—updating law to 



6i{t + 1) = 6i(t) + je { 9i(t) ^2 Wij{x k + r k ) - ^(t) ^ W ik x k ^ j (12 



where r is a fixed vector. 
The stable fixed point is now 

Oi = ^ (W,- (< x > +r)) (13) 

If one wishes to separate the eigenvalues from the information on the average 
data < x > one may add another parameter fii to each node with a learning 
law 

fH (t + l)= fH (t)+ 7 Jl-fi?(t) ^Wikx^j | (14) 

which, with the same assumptions about time scales as before, converges to 
the stable fixed point 

"-(£)* ,15) 

The convergence rate near the fixed point is j^aX" . Therefore choosing a 
large a accelerates convergence for the small eigenvalues. 

2 - In addition to its role in extracting additional information on the input 
signal, the node parameters (9.; also plays a role in accelerating the convergence to 
the stable fixed point. For fixed j w 9i, the rate of convergence to the fixed point is 
very slow for the components associated to the smallest eigenvalues. This is the 
reason for the convergence problems in Sanger's method and one also finds that 
sometimes the results for the small components are quite misleading. On the 
other hand to increase j w does not help because then the time scale of the 
learning law becomes of the order of the time scale of the data and one obtains 
large fluctuations in the principal components. With a node parameter and the 
learning law (||) the situation is more favorable because for small eigenvalues 
the effective control parameter (^ w 0i) is dynamically amplified. This accelerates 
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the convergence of the minor components without inducing fluctuations on the 
principal components. 

3- Here one examines the effect of the prescription to avoid the fixed points 
where some Oi = 0. Consider the average evolution of di, assuming all the other 
variables fixed 

6i(t + 1) = 0,(t) (1 + le Wi- < x >) - le 9j{t) (Wj-QWJ 

If W r < x >>0 the stable fixed point is at W,- < x > / (W.-QW,) and if 
< x ><0 it is at zero (see Fig. 3). If Wj- < x > is < 0, 9{ moves towards 

l 
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Figure 3: The effect of changing the sign of Wi in the approach to the stable 
fixed points 

the fixed point F0 at zero but, when it reaches (#i) m in> ^ ne change in the sign 
of the corresponding row W t in connection matrix changes the dynamics and it 
is Fl that is now the attracting fixed point. 

To illustrate the effect of node parameters in principal component analysis 
(PCA), consider the following two - dimensional x signal, where ti are Gaussian 
distributed variables with zero mean and unit variance: 

xi = ti; X2 = ti + 0.08<2 

The principal components of the signal and the eigenvalues are given in the 
following table. 

A, W a Wi 2 
2.0032 0.7060 0.7082 
0.0032 0.7082 -0.7060 
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A one - layer network with node parameters as in Fig. 2 is used to perform 
PCA. Fig. 4 shows the data and the principal directions that are obtained using 

4 
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Figure 4: Signal distribution and its principal directions 

the learning laws (||) and (|l2[). The parameter values used are j w =0.015, 
7e =0.005 and r=(0.002,0.002). Fig.5 shows the convergence of the W's to 
their final values in the learning process. Fig. 6 shows the variation of the node 
parameters. 

Sanger's original algorithm is recovered by fixing the node parameters to 
unit values. In this case the principal components are also extracted, but the 
convergence of the process is much slower as shown in Fig. 7. With node param- 
eters, improved convergence of the small eigenvalues is to be expected under 
generic conditions because the rates of convergence are controlled by ~f w 8i and 
9i is proportional to p In the example the effect of #2 is further amplified 
by the fact that, before W2 starts to converge, W2-r is large. Hence, in this 
case at least, the node parameter acts like a variable learning rate for the mi- 
nor component. Notice that the method does not induce oscillations in the 
major components as would happen with large j w values. Besides speeding up 
the learning process the node parameters also contain information about first 
moments and the eigenvalues. 

Node parameters also have a beneficial effect on competitive learning algo- 
rithms. For details refer to [ 531 . 
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Figure 5: Evolution of W\ and W 2 



2.2 Non-Gaussian data and the neural computation of the 
characteristic function 

The aim of principal component analysis (PCA) is to extract the eigenvectors 
of the correlation matrix, from the data. There are standard neural network 
algorithms for this purpose |2l| |22j |^3j p4j . However if the process is non- 
Gaussian, PCA algorithms or their higher-order generalizations provide only 
incomplete or misleading information on the statistical properties of the data. 

Let Xi denote the output of node i in a neural network. Hebbian learning fjpf 
is a type of unsupervised learning where the neural network connection strengths 
Wij are reinforced whenever the products large. The simplest form is 

AWij — rjXiXj (16) 

Hebbian learning extracts the eigenvectors of the correlation matrix Q 

(17) 



but, if the learning law is local as in Eq.fllq), all the lines of the connection 
matrix Wij converge to the eigenvector with the largest eigenvalue of the cor- 
relation matrix. To obtain other eigenvector directions requires non-local laws. 
These principal component analysis (PCA) algorithms find the characteristic 
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Figure 6: Evolution of B\ and O2 



directions of the correlation matrix (Q)ij =< XiXj >. If the data has zero 
mean (< Xi >= 0) they are the orthogonal directions along which the data has 
maximum variance. If the data is Gaussian in each channel, it is distributed as 
a hyperellipsoid and the correlation matrix Q already contains all the informa- 
tion about statistical properties. This is because higher order moments of the 
data may be obtained from the second order moments. However, if the data is 
non-Gaussian, the PCA analysis is not complete and higher order correlations 
are needed to characterize the statistical properties. This led some authors p5[ 
[p6| to propose networks with higher order neurons to obtain the higher order 
statistical correlations of the data. An higher order neuron is one that is capable 
of accepting, in each of its input lines, data from two or more channels at once. 
There is then a set of adjustable strengths Wij 1 , Wij 1 j 2 , Wij 1 ,..j n , n being the 
order of the neuron. Networks with higher order neurons have interesting appli- 
cations, for example in fitting data to a high-dimensional hypersurface. However 
there is a basic weakness in the characterization of the statistical properties of 
non-Gaussian data by higher order moments. Existence of the moments of a 
distribution function depends on the behavior of this function at infinity and it 
frequently happens that a distribution has moments up to a certain order, but 
no higher ones. A well-behaved probability distribution might even have no mo- 
ments of order higher than one (the mean). In addition a sequence of moments 
does not necessarily determine a probability distribution function uniquely P7J. 
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X104 

Figure 7: Evolution of W\ and W2 using Sanger's algorithm, without node 
parameters 



Two different distributions may have the same set of moments. Therefore, for 
non-Gaussian data, the PCA algorithms or higher order generalizations may 
lead to misleading results. 

As an example consider the two-dimensional signal shown in Fig. 8. Fig. 9 
shows the evolution of the connection strengths Wn and W12 when this signal is 
passed through a typical PCA algorithm. Large oscillations appear and finally 
the algorithm overflows. Smaller learning rates do not introduce qualitative 
modifications in this evolution. The values may at times appear to stabilize, 
but large spikes do occur. The reason is that the seemingly harmless data in 
Fig. 8 is generated by a linear combination of a Gaussian with the following 
distribution 

p(x) = k(2 + x 2 )-i (18) 

which has first moment, but no moments of higher order. 

To be concerned with non-Gaussian processes is not a pure academic exer- 
cise, because in many applications adequate tools are needed to analyze such 
processes. For example, processes without higher order moments, in particular 
those associated with Levy statistics, are prominent in complex processes such 
as relaxation in glassy materials, chaotic phase diffusion in Josephson junctions 
and turbulent diffusion p8[ p9| | [3C| ]. Moments of an arbitrary probability distri- 
bution may not exist. However, because every bounded and measurable function 
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Figure 8: A two-dimensional test signal 



is integrable with respect to any distribution, the existence of the characteristic 
function f(a) is always assured p7[. 

f{a) = J e ia - x dF(x) =< e ta - x > (19) 

a and x are TV-dimensional vectors, x is the data vector and F(x) its distribution 
function. The characteristic function is a compact and complete characteriza- 
tion of the probability distribution of the signal. If, in addition, one wishes to 
describe the time correlations of the stochastic process x{t), the corresponding 
quantity is the characteristic functional ]3l|] 

F(£) = J e^'^dfiix) (20) 

where £(t) is a smooth function and the scalar product is 

(x,0= I dtx(t)Z(t) (21) 



n(x) being the probability measure over the sample paths of the process. 

In the following I will describe an algorithm to compute the characteristic 
function from the data, by a learning process |52|. The main idea is that, in the 
end of the learning process, one has a neural network which is a representation 
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Figure 9: Evolution of the connection strengths Wu and Wu in a PCA network 
for the data in Fig. 8 



of the characteristic function. This network is then available to provide all the 
required information on the probability distribution of the data being analyzed. 

Suppose we want to learn the characteristic function /(a) of a one-dimensional 
signal x(t) in a domain a € [ao, a at] . The a-domain is divided in N intervals by 
a sequence of values ao, ai, «2> a at and a network is constructed with N + 1 
intermediate layer nodes and an output node (Fig. 10). 

The learning parameters in the network are the connection strengths Wm and 
the node parameters 9i . The existence of the node parameter means that the 
output of a node in the intermediate layer is 6iXi(ot), Xi being a non- linear func- 
tion. The use of both connection strengths and node parameters in neural net- 
works makes them equivalent to a wide range of other connectionist systems Jj"9[ 
and improves their performance in standard applications |24j] . The learning laws 
for the network in Fig. 10 are: 

9i(t + 1) = 9i(t) + 7 (cos Oix(t) - 9i(t)) 
W 0t {t + 1) = W 0i (t) +77£. (9 3 {t) - £ fc Wok(t)Xk(aj)9 k (t)) 9 i (t) Xi (a j ) 

(22) 

7, rj > 0. The intermediate layer nodes are equipped with a radial basis function 

e _ (Q _ Qs) 2 /2(T 2 

= T N e - (a - akW ( 23 ) 

where in general one uses <Xj = a for all i. The output is a simple additive 
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Figure 10: Network to learn the characteristic function of a scalar process 



node. The learning constant 7 should be sufficiently small to insure that the 
learning time is much smaller than the characteristic times of the data x(t). If 
this condition is satisfied each node parameter 9i tends to < cosaiX >, the real 
part of the characteristic function f(a) for a — oti. The W i learning law was 
chosen to minimize the error function 

f(W) = \ J2 (e, - E W 0k (t) X k{a 3 )d^ (24) 

One sees that the learning scheme is an hybrid one, in the sense that the node 
parameter 6i learns, in an unsupervised way, (the real part of) the characteristic 
function /(«i) and then, by a supervised learning scheme, the Wq^s are adjusted 
to reproduce the 9i value in the output whenever the input is a>i. Through the 
learning law ( p2] ) each node parameter 0i converges to < cos ctiX > and the 
interpolating nature of the radial basis functions guarantees that, after training, 
the network will approximate the real part of the characteristic function for any 
a in the domain [ao,aAr]. A similar network is constructed for the imaginary 
part of the characteristic function, where now 

e's + 1) = e's) + 7 (Bi I ia i x(f) - e'M) (25) 

For higher dimensional data the scheme is similar. The number of required nodes 
is N d for a d-dimensional data vector !?{t). For example for the 2-dimensional 
data of Fig. 8 a set of iV 2 nodes was used (Fig. 11). Each node in the square 
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Figure 11: Network to learn the characteristic function of a 2-dimensional signal 
af(t) 



lattice has two inputs for the two components ai and oti of the vector argument 
of f(~ct). The learning laws are, as before 

0(i 3 ) {t + 1) = 9 {ij) (t) + 7(cos cttfj) .~&(t) - 9 {ij) (t)) 
W 0W )(t + l) = W om (t)+ 

+ r lJ2(kl) (f(M)(t) - J2(mn) W O(mn)(t)X( m n)(~^(kl))0( m n)(t)j 8(ij)(t)X(ij)(~^ (kl)) 

(26) 

The pair (ij) denotes the position of the node in the square lattice and the 
radial basis function is 

-I C?- Utart | 2 /2of. 
g I ' (ij) 

Xm (?) = i=F ^ |2,„. g (27) 



Two networks are used, one for the real part of the characteristic function, and 
another for the imaginary part with, in Eqs.(p7[), cos"c?(y).~of(t) replaced by 
sm~ct ^jylif(t). Figs. 12 and 13 show the values computed by the algorithm for 
the real and imaginary parts of the characteristic function corresponding to the 
two-dimensional signal in Fig. 8. 
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Figure 12: Real part of the characteristic function for the data in Fig. 8 (left) 
and the mesh of 9i values (right) obtained by the network 



On the left is a plot of the exact characteristic function and on the right the 
values learned by the network. In this case we show only the mesh corresponding 
to the 9i values. One obtains a 2.0% accuracy for the real part and 4.5% accuracy 
for the imaginary part. The convergence of the learning process is fast and the 
approximation is reasonably good. Notice in particular the slope discontinuity 
at the origin which reveals the non-existence of a second moment. 

For a second example the data was generated by a Weierstrass random walk 
with probability distribution 



and b=1.31, which is a process of the Levy flight type. The characteristic 
function, obtained by the network, is shown in Fig. 14. 

Taking the log(— log) of the network output one obtains the scaling exponent 
1.49 near a = 0, close to the expected fractal dimension of the random walk 
path (1.5). 

These examples test the algorithm as a process identifier, in the sense that, 
after the learning process, the network is a dynamical representation of the 
characteristic function and may be used to perform all kinds of analysis of the 
statistics of the data. 




(28) 
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Figure 13: Imaginary part of the characteristic function for the data in Fig. 8 
(left) and the mesh of 6*,; values (right) obtained by the network 



3 Chaotic networks for information processing 

Freeman and collaborators Q |||] Q Q |36) |37| have extensively studied 
and modeled the neural activity in the mammalian olfactory system. Their 
conclusions challenge the idea that pattern recognition in the brain is accom- 
plished as in an attractor neural network p8|. Pattern recognition in the brain is 
the process by which external signals arriving at the sense organs are converted 
into internal meaningful states. The studies of the excitation patterns in the 
olfactory bulb of the rabbit lead to the conclusion that, at least in this biological 
pattern recognition system, there is no evolution towards an equilibrium fixed 
point nor does it seem to be minimizing an energy function. Other interesting 
conclusions of these biological studies are: 

- the main component of the neural activity in the olfactory system is chaotic. 
This is also true in other parts of the brain, periodic behavior occurring only in 
abnormal situations like deep anesthesia, coma, epileptic seizures or in areas of 
the cortex that have been isolated from the rest of the brain; 

- the low-level chaos that exists in absence of an external stimulus is, in the 
presence of a signal, replaced by bursts lasting for about 100 ms which have 
different intensities in different regions of the olfactory bulb. Olfactory pattern 
recognition manifests itself as a spatially coherent pattern of intensity; 

- the recognition time is very fast, in the sense that the transition between 
different patterns occurs in times as short as 6 ms. Given the neuron charac- 
teristic response times this is clearly incompatible with the global approach to 
equilibrium of an attractor neural network; 

- the biological measurements that have been performed do not record the 
action potential of individual neurons, but the local effect of the currents coming 
out of thousands of cells. Therefore the very existence of measurable activity 
bursts implies a synchronization of local assemblies of many neurons. 
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Figure 14: Characteristic function for the Weierstrass random walk (b=1.31) 



Freeman, Yao and Burke |35| ||(| model the olfactory system with a set of 
non-linear coupled differential equations, the coupling being adjusted by means 
of an input correlation learning scheme. Each variable in the coupled system is 
assumed to represent the dynamical state of a local assembly of many neurons. 
Based on numerical simulations they conjecture that olfactory pattern recogni- 
tion is realized through a multilobe strange attractor. The system would be, 
most of the time, in a basal (low-activity) state, being excited to one of the 
higher lobes by the external stimulus. 

To compute or even prove the existence of chaotic measures in coupled dif- 
ferential equation systems is a awesome task. Therefore, even if it may be bio- 
logically accurate, the analytical model of these authors is difficult to deal with 
and unsuitable for wide application in technological pattern recognition tasks, 
although one such application has indeed been attempted by the authors p7[ . 
However the idea that efficient pattern recognition may be achieved by a chaotic 
system, which selects distinct invariant measures according to the class of ex- 
ternal stimuli, is quite interesting and deserves further exploration. Inspired by 
the biological evidence a model has been developed |39j , which behaves roughly 
as an olfactory system (in Freeman's sense) and, at the same time, is easier to 
describe and control by analytical means. To play the role of the local chaotic 
assembly of neurons a Bernoulli unit is chosen. The connection between the 
units is realized by linear synapses with an input correlation learning law and 
the external inputs also have adjustable gains, changing as in a biological po- 
tentiation mechanism. This last feature turns out to be useful to enhance the 
novelty-filter qualities of the system. 
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Figure 15: log(-log) of the characteristic function for the Weierstrass random 
walk (b=1.31) 



3.1 A network of Bernoulli units 

The network is the fully connected system shown in Fig. 16. The output of 
the nodes is denoted by j/j and the x^s are the external inputs. Wij with 
i, j G {1, 2, • • • , N} are the connection strengths and Wio the input gains. Both 
Wij and Wio are real numbers in the interval [0,1]. The input patterns are 
zero-one sequences (xi G {0, 1}). The learning laws for the connection strengths 
and the input gains are the following: 

Let Sij(t) = Wij(t) + T]x i (i)x j (t) . Then for i ^ j 




(29) 



Wij{t + l) = C 



Sij(t) 



if ^2 S ik(t)>C 



(30) 



for i = j 




(31) 



and for the input gains 



W i0 (t + 1) 



Ni(l) t 



(32) 



= a 



Ni(l) t + Ni{0) t 
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Figure 16: The Bernoulli network 



According to Eqs.( f29f|32| ), when an input pattern has a one in both the i 
and j positions, the correlation of the units i and j becomes stronger. C < 1 
is a constant related to the node dynamics, which the sum of the off-diagonal 
connections is not allowed to exceed. r\ is a small parameter that controls the 
learning speed. Finally the diagonal element Wu is chosen in such a way that 
the sum of all connections entering each unit adds to one. 

In the input gain learning law, iV,(l)t (or A^(O)t) is the number of times 
that a one (or a zero) has appeared at the input i, up to time t. Eq.(|32]) means 
that if an input is excited many times, during the learning phase, it becomes 
more sensitive. 

The node dynamics is 

Vi{t + 1) = / \J2 W« (*)%■(*) + WmXiQt) j (33) 

/ being the function depicted in Fig. 17. 

The learning process starts with Wu — 1 and Wij = for i ^ j. Each unit 
has then an independent absolutely continuous invariant measure which is the 
Lebesgue measure in [0,C] and zero outside. When the Wij (i ^ j) become 
different from zero but the inputs Xi are still zero, all variables j/j stay in the 
interval [0, C] because of the convex linear combination of inputs imposed by 
the normalization of the Wy's. When some inputs a;, are ^ 0, there is a finite 
probability for an irregular burst in the interval [C, 1] , of some of the node 
variables, with reinjection into [0, C] whenever the iterate falls on the interval 

The bursts in the interval [C, 1], in response to some of the input patterns, is 
the recognition mechanism of the network. The basal chaotic dynamics insuring 
an uniform covering of the interval [0, C], the timing of the onset of the bursts 
depends only on the correlation probability and on the clock time of the discrete 
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Figure 17: The node dynamics function 



dynamics. We understand therefore why a chaos-based network may have a 
recognition time faster than an attractor network. 

Learning, invariant measures and simulations 

Both the connection strengths and the nature of the bursts, for a given set 
of W's and an applied input pattern, may be estimated in probability. 

In Eqs.(p^-^0|), either the node i is not correlated to any other node and then 
all off-diagonal elements are zero or, as soon as the input patterns begin to 
correlate the node i with any other node, the off-diagonal elements start to grow 



and only the second case (Eq.(30)) needs to be considered. Let the learning gain 



be small. Then, in first order in 77, we have 

Wtj{t + 1) = W l3 {t) + TjXiWxjW - -Wijit) Y,Xi{t)x k {t) 

k^i 

For N learning steps, in first order in r\ 

N-l N-l 

Wijit+N) = Wij(t)+ri ^(t+^Xjit+^-^Wijit)^^ xt(t+n)x k (t+n) 

n—0 k^i n—0 

Denoting by Pij(l) the probability for the occurrence, in the input patterns, of 
a one both in the i and the j positions, the above equation has the stationary 
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solution 

Now we establish an equation for the burst probabilities. Consider the case 
where C is much smaller than one, that is, the basal chaos is of low intensity. In 
this case, because of the normalization chosen for Wu, the dynamics inside the 
interval [0, C] is dominated by y — » 2y(mod C) and in the interval [C, 1] by y — > 
2y(mod 1). Hence, to a good approximation, we may assume uniform probability 
measures for the motion inside each one of the intervals. Denoting the interval 
[0, C] as the state 1 and the interval [C, 1] as the state 2, the dynamics of 
each node is a two-state Markov process with transition probabilities between 
the states corresponding to the probabilities of falling in some subintervals of 
the intervals [0, C] and [C, 1]. Namely, the probability p(2 — > 1) equals the 
probability of falling in the reinjection interval + § + an d the probability 
p(l — > 2) that of falling near the point C at a distance smaller than the off- 
diagonal excitation. 

*(2 - 1), = (34) 
p j (2-»2) t = l-p j (2-»l) i (35) 

^ 2) t = j^— fwioSiW - C(l - W«) + £Wy tfJ (t)j | (36) 

ft(l-»l) t = l-ft(l-»2) t (37) 
where we have used the notation 

f* = (/ V 0) A 1 

for functions truncated to the range [0, 1]. That is, /# = if / < 0, = 1 if 
/ > 1 and f* = f if 1 > / > 0. 

The sum in the right-hand side of Eq. (p6|) is approximated in probability by 



Qpi(2)t + y(l-ft(2)t)) (38) 



where Pj(2)t denotes the probability of finding the node j in the state 2 at 
time t. The probability estimate (|38|), for the outputs yj , assumes statistical 
independence of the units. This hypothesis fails when there arc synchronization 
effects, which are to be expected mainly when a small group of units is strongly 
correlated. 

With the probability estimate for the y/s and the detailed balance principle 
it is now possible to write a self-consistent equation for the probability pi (2) t to 
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find an arbitrary node i in the state 2 at time t 



{ W-c (WioXi(t) - C(l - Wu) + Ej* Wtf (^(2)* + § (1 - Pi(2)*)))} 

{wT"c (^WO - C(l - W«) + E j¥i (|pi(2) t + f (1 -p 3 -(2) t ))) } # + 5 

(39) 



For each input pattern :Ej(t), one obtains an estimate for Pi(2) t solving 
Eq.([39|) by iteration. We find that the solution that is obtained is qualitatively 
similar to the numerically determined invariant measures, although it tends to 
overestimate the burst excitation probabilities when they are small. This may 
be understood from the synchronization effects between groups of units. When 
one unit is not excited (not in state 2) the others tend also not to be excited, 
hence ( |38| ) overestimates the sum Ej/j 

We now illustrate how the network behaves as an associator and pattern 
recognizer. Consider, for display simplicity, a network of four nodes that is 
exposed during many iterations to the patterns 1000 and 0110 where the first 
pattern appears twice as much as the second. After this learning period we 
have exposed the network to all possible zero-one input patterns for 500 time 
steps each and observed the network reaction. During the recall experiment no 
further adjustment of the Wij's is made. The result is shown in the Fig. 18. 

The conclusions from this and other simulations is that, according to nature 
of the learning patterns, the network acts, for the recall input patterns, as a 
mixture of memory, associator and novelty filter. For example in Fig. 18 we see 
that after having learned the sequences 1000 and 0110, the network reproduces 
these patterns as a memory. The pattern 1001 is associated to the pattern 1000 
and the pattern 1110 associated to a mixture of the two learned patterns. By 
contrast the pattern 0100 is not recognized by the network which acts then as 
a novelty filter. Fig. 19 shows the invariant measures of the system (expressed 
in probability per bin) when the input patterns are 1100 and 0110. 

In conclusion, the network based on Bernoulli units, with node dynamics as 
in Fig. 17 and correlation learning described by Eqs.( p9[j3^ ): 

1. Is a chaos-based pattern recognizer with the capability of operating on 
distinct invariant measures which are selected by the input patterns; 

2. The response time of the selection is controlled by the magnitude of the 
invariant measures and the clock time of the basal chaotic dynamics; 

3. As a pattern recognizer the network is a mixture of memory, associator 
and novelty filter. This however is sensitive to the learning algorithm that is 
chosen and, for the algorithm that is discussed here, it is sensitive also to value 
of the parameters C and r/. 

3.2 Feigenbaum networks 

In this part one studies dynamical systems composed of a set of coupled quadratic 
maps 




(40) 
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Figure 18: Response of a 4-nodes network after being exposed to the patterns 
1000 and 0110 (C = 0.1, a = 0.001) 



with x e [-1, 1], Wij = 1 Mi and > Mi,j and /i* = 1.401155.... The 
value chosen for /i* implies that, in the uncoupled limit (Wu = 1, Wij = i 7^ j), 
each unit transforms as a one-dimensional quadratic map in the accumulation 
point of the Feigenbaum period-doubling bifurcation cascade. This system will 
be called a Feigenbaum network. 

The quadratic map at the Feigenbaum accumulation point is not in the class 
of chaotic systems (in the sense of having positive Lyapunov exponents) how- 
ever, it shares with them the property of having an infinite number of unstable 
periodic orbits. Therefore, before the interaction sets in, each elementary map 
possesses an infinite diversity of potential dynamical behaviors. As we will show 
later, the interaction between the individual units is able to selectively stabilize 
some of the previously unstable periodic orbits. The selection of the periodic 
orbits that are stabilized depends both on the initial conditions and on the 
intensity of the interaction coefficients Wij . As a result Feigenbaum networks 
appear as systems with potential applications in the fields of control of chaos, 
information processing and as models of self-organization. 

Control of chaos or of the transition to chaos has been, in recent years, a 
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very active field (see for example Ref. E9] and references therein). Several meth- 
ods were developed to control the unstable periodic orbits that are embedded 
within a chaotic attractor. Having a way to select and stabilize at will these 
orbits we would have a device with infinite storage capacity (or infinite pattern 
discrimination capacity). However, an even better control might be achieved if, 
instead of an infinite number of unstable periodic orbits, the system possesses an 
infinite number of periodic attractors. The basins of attraction would evidently 
be small but the situation is in principle more favorable because the control 
need not be as sharp as before. As long as the system is kept in a neighborhood 
of an attractor the uncontrolled dynamics itself stabilizes the orbit. 

The creation of systems with infinitely many sinks near an homoclinic tan- 
gency was discovered by Newhouse Q and later studied by several other au- 
thors § § § § |§. In the Newhouse phenomenon infinitely many at- 
tractors may coexist but only for special parameter values, namely for a residual 
subset of an interval. Another system, different from the Newhouse phenomena, 
which also displays many coexisting periodic attractors is a rotor map with a 
small amount of dissipation J47|. 

Here one shows that for a Feigenbaum system with only two units and sym- 
metrical couplings one obtains a system which has an infinite number of sinks 
for an open set of coupling parameters. Then one also analyzes the behavior 
of a Feigenbaum network in the limit of a very large number of units. A mean 
field analysis shows how the interaction between the units may generate distinct 
periodic orbit patterns throughout the network. 

3.2.1 A simple system with an infinite number of sinks 

Consider two units with symmetrical positive couplings (W12 = W21 — c > 0) 

xi(t + 1) = 1 - fj,» ((1 - c)xi{t) + cx 2 (t)) 2 , 
x 2 (t + 1) = 1 -/x, (cxx(i) + (1 -c)x 2 {t)f 1 ' 

The mechanism leading to the emergence of periodic attractors from a system 
that, without coupling, has no stable finite-period orbits is the permanence of 
the unstabilized orbits in a flip bifurcation and the contraction effect introduced 
by the coupling. The structure of the basins of attraction is also understood 
from the same mechanism. The result isp§|: 

For sufficiently small c there is an N such that the system ( ^l| j has stable 
periodic orbits of all periods 2™ for n > N . 

3.2.2 Feigenbaum networks with many units 

Mean-field analysis 

For the calculations below it is convenient to use as variable the net input 



to the units, t/j = V WijXj. Then Eq.(40) becomes 



N 



yi (t + l) = l-^Y,W ijy ](t) (42) 
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For practical purposes some restrictions have to be put on the range of values 
that the connection strengths may take. For information processing (pattern 
storage and pattern recognition) it is important to preserve, as much as possible, 
the dynamical diversity of the system. That means, for example, that a state 
with all the units synchronized is undesirable insofar as the effective number of 
degrees of freedom is drastically reduced. From 

5 Vl {t + 1) = -2^y(t)(W u S yi (t) + WySytf)) (43) 

one sees that instability of the fully synchronized state implies \2{j,*y(t)Wu\ > 1. 
Therefore, the interesting case is when the off-diagonal connections are suffi- 
ciently small to insure that 

W u > — (44) 

For large N, provided there is no large scale synchronization effect, a mean- 
field analysis might be appropriate, at least to obtain qualitative estimates 
on the behavior of the network. For the unit i the average value < 1 — 
Sj^i Wijyj(t) > acts like a constant and the mean-field dynamics is 

2i(t + l) = l-Hi,eff zj{t) (45) 

where 

z% = <i- /1 ,E^]> (46) 

and 

IM >e ff = fl*W U < 1 - Wi jVj > ( 47 ) 

fJiieff is the effective parameter for the mean- field dynamics of unit i. From 
( fSj ) and j47| ) it follows fi*Wu > ^i,eff > (i*Wa(2 — fi*). The conclusion 
is that the effective mean-field dynamics always corresponds to a parameter 
value below the Feigcnbaum accumulation point, therefore, one expects the 
interaction to stabilize the dynamics of each unit in one of the 2™- periodic 
orbits. On the other hand to keep the dynamics inside an interesting region we 
require fJ>i, e ff > ^2 — 1-3681, the period-2 bifurcation point. With the estimate 
< y 2 >= | one obtains 

M^iCl-yCl-Wii)) > M 2 (48) 



which, together with ( |44| ) , defines the interesting range of parameters for Wu 
Feigenbaum networks as signal processors 
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Let, for example, the Wij connections be constructed from an input signal 
Xi by a correlation learning process 

Wij -» W' l0 = {Wij + nx lXj )e^ for i ^ j 

The dynamical behavior of the network, at a particular time, will reflect the 
learning history, that is, the data regularities, in the sense that Wij is being 
structured by the patterns that occur more frequently in the data. The de- 
cay term e~ 7 insures that the off-diagonal terms remain small and that the 
network structure is determined by the most frequent recent patterns. Alterna- 
tively, instead of the decay term, we might use a normalization method and the 
connection structure would depend on the weighted effect of all the data. 

In the operating mode described above the network acts as a signal identi- 
fier. For example if the signal patterns are random, there is little correlation 
established and all the units operate near the Feigenbaum point. Alternatively 
the learning process may be stopped at a certain time and the network then 
used as a pattern recognizer. In this latter mode, whenever the pattern {xi} 
appears, one makes the replacement 

Wij -► W-j = WijXiXj for % £ j 
Wii^W' iA = l-Y l j^W ij 1 ] 

Therefore if Wij was ^ but either Xi or Xj is = then W t j = 0. That is, the 
correlation between node i and j disappears and the effect of this connection 
on the lowering of the periods vanishes. 

If both Xi and Xj are one, then = Wij and the effect of this connection 
persists. Suppose however that for all the Wij 's different from zero either xi or Xj 
are equal to zero. Then the correlations are totally destroyed and the network 
comes back to the uncorrelated (nonperiodic behavior). This case is what is 
called a novelty filter. Conversely, by displaying periodic behavior, the network 
recognizes the patterns that are similar to those that, in the learning stage, 
determined its connection structure. Recognition and association of similar 
patterns is then performed Gq]. 
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Figure 19: Invariant measures for the input patterns (a)1000 and (b)0110 
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